132 research outputs found
Real-time human performance capture and synthesis
Most of the images one finds in the media, such as on the Internet or in textbooks and magazines, contain humans as the main point of attention. Thus, there is an inherent necessity for industry, society, and private persons to be able to thoroughly analyze and synthesize the human-related content in these images. One aspect of this analysis and subject of this thesis is to infer the 3D pose and surface deformation, using only visual information, which is also known as human performance capture. Human performance capture enables the tracking of virtual characters from real-world observations, and this is key for visual effects, games, VR, and AR, to name just a few application areas. However, traditional capture methods usually rely on expensive multi-view (marker-based) systems that are prohibitively expensive for the vast majority of people, or they use depth sensors, which are still not as common as single color cameras. Recently, some approaches have attempted to solve the task by assuming only a single RGB image is given. Nonetheless, they can either not track the dense deforming geometry of the human, such as the clothing layers, or they are far from real time, which is indispensable for many applications. To overcome these shortcomings, this thesis proposes two monocular human performance capture methods, which for the first time allow the real-time capture of the dense deforming geometry as well as an unseen 3D accuracy for pose and surface deformations. At the technical core, this work introduces novel GPU-based and data-parallel optimization strategies in conjunction with other algorithmic design choices that are all geared towards real-time performance at high accuracy. Moreover, this thesis presents a new weakly supervised multiview training strategy combined with a fully differentiable character
representation that shows superior 3D accuracy. However, there is more to human-related Computer Vision than only the analysis of people in images. It is equally important to synthesize new images of humans in unseen poses and also from camera viewpoints that have not been observed in the real world. Such tools are essential for the movie industry because they, for example, allow the synthesis of photo-realistic virtual worlds with real-looking humans or of contents that are too dangerous for actors to perform on set. But also video conferencing and telepresence applications can benefit from photo-real 3D characters, as they can enhance the immersive experience of these applications. Here, the traditional Computer Graphics pipeline for rendering photo-realistic images involves many tedious and time-consuming steps that require expert knowledge and are far from real time. Traditional rendering involves character rigging and skinning, the modeling of the surface appearance properties, and physically based ray tracing. Recent learning-based methods attempt to simplify the traditional rendering pipeline and instead learn the rendering function from data resulting in methods that are easier accessible to non-experts. However, most of them model the synthesis task entirely in image space such that 3D consistency cannot be achieved, and/or they fail to model motion- and view-dependent appearance effects. To this end, this thesis presents a method and ongoing work on character synthesis, which allow the synthesis of controllable photoreal characters that achieve motion- and view-dependent appearance effects as well as 3D consistency and which run in real time. This is technically achieved by a novel coarse-to-fine geometric character representation for efficient synthesis, which can be solely supervised on multi-view imagery. Furthermore, this work shows how such a geometric representation can be combined with an implicit surface representation to boost synthesis and geometric quality.In den meisten Bildern in den heutigen Medien, wie dem Internet, Büchern und Magazinen, ist der Mensch das zentrale Objekt der Bildkomposition. Daher besteht eine inhärente Notwendigkeit für die Industrie, die Gesellschaft und auch für Privatpersonen, die auf den Mensch fokussierten Eigenschaften in den Bildern detailliert analysieren und auch synthetisieren zu können. Ein Teilaspekt der Anaylse von menschlichen Bilddaten und damit Bestandteil der Thesis ist das Rekonstruieren der 3D-Skelett-Pose und der Oberflächendeformation des Menschen anhand von visuellen Informationen, was fachsprachlich auch als Human Performance Capture bezeichnet wird. Solche Rekonstruktionsverfahren ermöglichen das Tracking von virtuellen Charakteren anhand von Beobachtungen in der echten Welt, was unabdingbar ist für Applikationen im Bereich der visuellen Effekte, Virtual und Augmented Reality, um nur einige Applikationsfelder zu nennen. Nichtsdestotrotz basieren traditionelle Tracking-Methoden auf teuren (markerbasierten) Multi-Kamera Systemen, welche für die Mehrheit der Bevölkerung nicht erschwinglich sind oder auf Tiefenkameras, die noch immer nicht so gebräuchlich sind wie herkömmliche Farbkameras. In den letzten Jahren gab es daher erste Methoden, die versuchen, das Tracking-Problem nur mit Hilfe einer Farbkamera zu lösen. Allerdings können diese entweder die Kleidung der Person im Bild nicht tracken oder die Methoden benötigen zu viel Rechenzeit, als dass sie in realen Applikationen genutzt werden könnten. Um diese Probleme zu lösen, stellt die Thesis zwei monokulare Human Performance Capture Methoden vor, die zum ersten Mal eine Echtzeit-Rechenleistung erreichen sowie im Vergleich zu vorherigen Arbeiten die Genauigkeit von Pose und Oberfläche in 3D weiter verbessern. Der Kern der Methoden beinhaltet eine neuartige GPU-basierte und datenparallelisierte Optimierungsstrategie, die im Zusammenspiel mit anderen algorithmischen Designentscheidungen akkurate Ergebnisse erzeugt und dabei eine Echtzeit-Laufzeit ermöglicht. Daneben wird eine neue, differenzierbare und schwach beaufsichtigte, Multi-Kamera basierte Trainingsstrategie in Kombination mit einem komplett differenzierbaren Charaktermodell vorgestellt, welches ungesehene 3D Präzision erreicht. Allerdings spielt nicht nur die Analyse von Menschen in Bildern in Computer Vision eine wichtige Rolle, sondern auch die Möglichkeit, neue Bilder von Personen in unterschiedlichen Posen und Kamera- Blickwinkeln synthetisch zu rendern, ohne dass solche Daten zuvor in der Realität aufgenommen wurden. Diese Methoden sind unabdingbar für die Filmindustrie, da sie es zum Beispiel ermöglichen, fotorealistische virtuelle Welten mit real aussehenden Menschen zu erzeugen, sowie die Möglichkeit bieten, Szenen, die für den Schauspieler zu gefährlich sind, virtuell zu produzieren, ohne dass eine reale Person diese Aktionen tatsächlich ausführen muss. Aber auch Videokonferenzen und Telepresence-Applikationen können von fotorealistischen 3D-Charakteren profitieren, da diese die immersive Erfahrung von solchen Applikationen verstärken. Traditionelle Verfahren zum Rendern von fotorealistischen Bildern involvieren viele mühsame und zeitintensive Schritte, welche Expertenwissen vorraussetzen und zudem auch Rechenzeiten erreichen, die jenseits von Echtzeit sind. Diese Schritte beinhalten das Rigging und Skinning von virtuellen Charakteren, das Modellieren von Reflektions- und Materialeigenschaften sowie physikalisch basiertes Ray Tracing. Vor Kurzem haben Deep Learning-basierte Methoden versucht, die Rendering-Funktion von Daten zu lernen, was in Verfahren resultierte, die eine Nutzung durch Nicht-Experten ermöglicht. Allerdings basieren die meisten Methoden auf Synthese-Verfahren im 2D-Bildbereich und können daher keine 3D-Konsistenz garantieren. Darüber hinaus gelingt es den meisten Methoden auch nicht, bewegungs- und blickwinkelabhängige Effekte zu erzeugen. Daher präsentiert diese Thesis eine neue Methode und eine laufende Forschungsarbeit zum Thema Charakter-Synthese, die es erlauben, fotorealistische und kontrollierbare 3D-Charakteren synthetisch zu rendern, die nicht nur 3D-konsistent sind, sondern auch bewegungs- und blickwinkelabhängige Effekte modellieren und Echtzeit-Rechenzeiten ermöglichen. Dazu wird eine neuartige Grobzu- Fein-Charakterrepräsentation für effiziente Bild-Synthese von Menschen vorgestellt, welche nur anhand von Multi-Kamera-Daten trainiert werden kann. Daneben wird gezeigt, wie diese explizite Geometrie- Repräsentation mit einer impliziten Oberflächendarstellung kombiniert werden kann, was eine bessere Synthese von geomtrischen Deformationen sowie Bildern ermöglicht.ERC Consolidator Grant 4DRepL
LiveCap: Real-time Human Performance Capture from Monocular Video
We present the first real-time human performance capture approach that
reconstructs dense, space-time coherent deforming geometry of entire humans in
general everyday clothing from just a single RGB video. We propose a novel
two-stage analysis-by-synthesis optimization whose formulation and
implementation are designed for high performance. In the first stage, a skinned
template model is jointly fitted to background subtracted input video, 2D and
3D skeleton joint positions found using a deep neural network, and a set of
sparse facial landmark detections. In the second stage, dense non-rigid 3D
deformations of skin and even loose apparel are captured based on a novel
real-time capable algorithm for non-rigid tracking using dense photometric and
silhouette constraints. Our novel energy formulation leverages automatically
identified material regions on the template to model the differing non-rigid
deformation behavior of skin and apparel. The two resulting non-linear
optimization problems per-frame are solved with specially-tailored
data-parallel Gauss-Newton solvers. In order to achieve real-time performance
of over 25Hz, we design a pipelined parallel architecture using the CPU and two
commodity GPUs. Our method is the first real-time monocular approach for
full-body performance capture. Our method yields comparable accuracy with
off-line performance capture techniques, while being orders of magnitude
faster
NeuralClothSim: Neural Deformation Fields Meet the Kirchhoff-Love Thin Shell Theory
Cloth simulation is an extensively studied problem, with a plethora of
solutions available in computer graphics literature. Existing cloth simulators
produce realistic cloth deformations that obey different types of boundary
conditions. Nevertheless, their operational principle remains limited in
several ways: They operate on explicit surface representations with a fixed
spatial resolution, perform a series of discretised updates (which bounds their
temporal resolution), and require comparably large amounts of storage.
Moreover, back-propagating gradients through the existing solvers is often not
straightforward, which poses additional challenges when integrating them into
modern neural architectures. In response to the limitations mentioned above,
this paper takes a fundamentally different perspective on physically-plausible
cloth simulation and re-thinks this long-standing problem: We propose
NeuralClothSim, i.e., a new cloth simulation approach using thin shells, in
which surface evolution is encoded in neural network weights. Our
memory-efficient and differentiable solver operates on a new continuous
coordinate-based representation of dynamic surfaces, i.e., neural deformation
fields (NDFs); it supervises NDF evolution with the rules of the non-linear
Kirchhoff-Love shell theory. NDFs are adaptive in the sense that they 1)
allocate their capacity to the deformation details as the latter arise during
the cloth evolution and 2) allow surface state queries at arbitrary spatial and
temporal resolutions without retraining. We show how to train our
NeuralClothSim solver while imposing hard boundary conditions and demonstrate
multiple applications, such as material interpolation and simulation editing.
The experimental results highlight the effectiveness of our formulation and its
potential impact.Comment: 27 pages, 22 figures and 3 tables; project page:
https://4dqv.mpi-inf.mpg.de/NeuralClothSim
Scene-Aware 3D Multi-Human Motion Capture from a Single Camera
In this work, we consider the problem of estimating the 3D position of
multiple humans in a scene as well as their body shape and articulation from a
single RGB video recorded with a static camera. In contrast to expensive
marker-based or multi-view systems, our lightweight setup is ideal for private
users as it enables an affordable 3D motion capture that is easy to install and
does not require expert knowledge. To deal with this challenging setting, we
leverage recent advances in computer vision using large-scale pre-trained
models for a variety of modalities, including 2D body joints, joint angles,
normalized disparity maps, and human segmentation masks. Thus, we introduce the
first non-linear optimization-based approach that jointly solves for the
absolute 3D position of each human, their articulated pose, their individual
shapes as well as the scale of the scene. In particular, we estimate the scene
depth and person unique scale from normalized disparity predictions using the
2D body joints and joint angles. Given the per-frame scene depth, we
reconstruct a point-cloud of the static scene in 3D space. Finally, given the
per-frame 3D estimates of the humans and scene point-cloud, we perform a
space-time coherent optimization over the video to ensure temporal, spatial and
physical plausibility. We evaluate our method on established multi-person 3D
human pose benchmarks where we consistently outperform previous methods and we
qualitatively demonstrate that our method is robust to in-the-wild conditions
including challenging scenes with people of different sizes.Comment: Accepted to Eurographics 2023. See also github:
https://github.com/dluvizon/scene-aware-3d-multi-human project page:
https://vcai.mpi-inf.mpg.de/projects/scene-aware-3d-multi-human
Monocular Real-time Full Body Capture with Inter-part Correlations
We present the first method for real-time full body capture that estimates
shape and motion of body and hands together with a dynamic 3D face model from a
single color image. Our approach uses a new neural network architecture that
exploits correlations between body and hands at high computational efficiency.
Unlike previous works, our approach is jointly trained on multiple datasets
focusing on hand, body or face separately, without requiring data where all the
parts are annotated at the same time, which is much more difficult to create at
sufficient variety. The possibility of such multi-dataset training enables
superior generalization ability. In contrast to earlier monocular full body
methods, our approach captures more expressive 3D face geometry and color by
estimating the shape, expression, albedo and illumination parameters of a
statistical face model. Our method achieves competitive accuracy on public
benchmarks, while being significantly faster and providing more complete face
reconstructions
ROAM: Robust and Object-aware Motion Generation using Neural Pose Descriptors
Existing automatic approaches for 3D virtual character motion synthesis
supporting scene interactions do not generalise well to new objects outside
training distributions, even when trained on extensive motion capture datasets
with diverse objects and annotated interactions. This paper addresses this
limitation and shows that robustness and generalisation to novel scene objects
in 3D object-aware character synthesis can be achieved by training a motion
model with as few as one reference object. We leverage an implicit feature
representation trained on object-only datasets, which encodes an
SE(3)-equivariant descriptor field around the object. Given an unseen object
and a reference pose-object pair, we optimise for the object-aware pose that is
closest in the feature space to the reference pose. Finally, we use l-NSM,
i.e., our motion generation model that is trained to seamlessly transition from
locomotion to object interaction with the proposed bidirectional pose blending
scheme. Through comprehensive numerical comparisons to state-of-the-art methods
and in a user study, we demonstrate substantial improvements in 3D virtual
character motion and interaction quality and robustness to scenarios with
unseen objects. Our project page is available at
https://vcai.mpi-inf.mpg.de/projects/ROAM/.Comment: 12 pages, 10 figures; project page:
https://vcai.mpi-inf.mpg.de/projects/ROAM
- …